Nejaký text z čoho pozostavala praca alebo ustredna tema? Zadanie? kto je sponzor? Kroky?

1 Data Preparation

1.1 Data Understanding

1.1.1 Data Description Report

The initial data was provided in a comma-separated values file, and was loaded and processed using the R programming language. Dataset used in this analysis contains 2 353 012 observations and 24 variables. Out of the 24 variables, 13 are of factor datatype, 9 are numeric and 2 are dates. All columns from the initial data were converted to the correct datype according to the data description file, which was provided. Column “payed_ammount” was replaced by column “paid_amount”. Column “payment_date” originally contained some blank fields, which were subsequently filled in as NA.

Column name Description Type
contract_id Unique identificator of the contract Int
payment_order Order of the payment Int
due_date Payment deadline Date
payment_date Date of the payment Date
product_type Type of the product Factor
contract_status Contract status Factor
business_discount Business discount provided Factor
gender Gender Factor
marital_status Marital status Factor
number_of_children Number of children Int
number_other_product Number of other products Int
clients_phone T/F if the client filled in home phone Factor
client_mobile T/F if the client filled in mobile phone Factor
client_email T/F if the client filled in email address Factor
total_earnings Earning bucket Factor
birth_year Birth year of the client Int
birth_month Birth month of the client Int
living_area Region of the client home address Factor
different_contact_area T/F if the client filled different home and contact address Factor
kc_flag T/F if the client does not home local citizenship Factor
cf_val If the special measure during the underwriting was applied
kzmz_flag T/F if the client filled in employer Factor
due_amount Installment what should be payed Numeric
payed_amount What was payed at a certain date Numeric

1.1.1.1 Value ranges

Frequency, relative frequency and relative cumulative frequency were computed for each category in all categorical variables. All frequency tables can be located in the “frequencies” list.

1.1.1.2 Attribute correlations

m’lady

We computed correlation coefficients between all possible pairs of numeric variables, see Figure ??, and discovered strong positive correlation between due amount and paid amount. This could be due to the fact that in the event that the installment has already been paid, the due amount and the paid amount would assume the same value. Correlation between the remaining pairs of numeric variables was either nonexistent or negligible. Then, the significance of correlation between due amount and paid amount was tested using Pearson’s product moment correlation coefficient. The pair of attributes was found to be significantly correlated with a correlation coefficient of 0.76 and p-value less than 2.2e-16.

Relationship between categorical variables was tested using chi-squared test with the significance level of 0.05. All significantly correlated pairs of variables can be accessed in “categorical_rel” dataframe.

1.1.1.3 Basic statistics

Basic statistics computed for numeric variables can be located in Table @ref{tab:statistics}. Distribution of numeric and categorical variables was visualized using boxplots, density plots and histograms, see in Attachments, Figure 5.2 and Figure 5.1.

1.2 Data Exploration Report

NIECO HODIT

1.3 Data Quality Report

1.3.1 Data coverage

Next step consited of the data coverage analysis. As an example, we have chosen a couple of plots, that indicate interesting data distribution. We for example found out, that clients mostly order the product type 1, contracts are mostly in status number 1 or that most of the payments have a discount. We also discovered, that the marital status of the clients is mostly number 3 and they have most frequently no children. Clients also very frequently do not provide information about their earnings and they ordered usually 1 other product.

All the mentioned findings can be seen on visualizations on Figure ??. All the data coverage visualizations are available in the Attachments.

1.3.2 Missing values

Exploring the NA values in the dataset, we found out, that 4 attributes had almost same percentage of missing values, as can be seen on the statistics on Figure @ref(fig:Missing_stat). Attributes kc_flag, living_area, cf_val and different_contact_area have the most missing values, almost 20 %, whereas payment_order has around 3,5 % and payment_date and delay have the same percentage, almost 0,5 %.

Using a different visualization, that can be seen on Figure @ref(fig:Missing_complex), we discovered, that the four attributes with the highest percentage are not missing at random but almost all at the same time.

We found out, that contract_id together with payment_order were not creating a unique key of the data payment, but one payment was divided into multiple parts, which was also causing problem with NA values in the four attributes. Data in the four attributes were not copied into other parts of a payment, but were present in just the first payment part. We decided to unify the payment parts into only one payment by summarizing the paid amount of all the parts and using the payment_date of the last paid part. Thanks to the unification, the amount of NA values has markedly decreased.

Secondly, we dealt with the NA values in payment_order and payment_date. Since it was only less then 4 % of the dataset, and it was not possible to substitute the values, we decided to delete the rows.

## # A tibble: 25 x 3
##    variable               n_miss pct_miss
##    <chr>                   <int>    <dbl>
##  1 different_contact_area 471354   20.0  
##  2 cf_val                 469310   19.9  
##  3 living_area            469015   19.9  
##  4 kc_flag                468906   19.9  
##  5 payment_order           83361    3.54 
##  6 payment_date            11733    0.499
##  7 delay                   11733    0.499
##  8 contract_id                 0    0    
##  9 due_date                    0    0    
## 10 product_type                0    0    
## # ... with 15 more rows
\label{fig:Missing_complex}Distribution of missing values

(#fig:Missing_complex)Distribution of missing values

missing_values_plot

1.4 Feature engineering

We decided to add new features to create higher-accuracy models. First of all, we computed a numerical feature “delay” counting the difference between payment_date and due_date.

Since we are creating two classification models deciding whether a new payment will be delayed for more than 21 days or more than 140 days, we created 2 new binary features “delay_21_y” and “delay_140_y”.

We also created a new numerical feature “delay_indiv” counting the mean delay for the whole client´s history. We also computed 2 new numerical features, “delay_indiv_21” and “delay_indiv_140” counting cumulative number of delayed payments (21, 140 days) by one contract

Lastly, numerical features “mean_delay_1m” ,“mean_delay_3m”, “mean_delay_6m”, “mean_delay_12m” are computing the mean delay for 1/3/6/12 months for each contract.

1.5 Exploratory analysis of the new features

1.5.1 Data description report

Column name Description Type
delay Difference between payment_date and due_date Int
delay_21_y T/F if the delay is more than 21 days Factor
delay_140_y T/F if the delay is more than 140 days Int
delay_indiv Mean delay for the whole client’s history Int
delay_indiv_21 Cumulative sum of payments delayed for more than 21 days by contract Int
delay_indiv_140 Cumulative sum of the payments delayed for more than 140 days by contract Int
mean_delay_1m Average payment delay for the last month Int
mean_delay_3m Average payment delay for the last 3 months Int
mean_delay_6m Average payment delay for the last 6 months Int
mean_delay_12m Average payment delay for the last 12 months Int

1.5.2 Basic statistics

We computed basic statistics for each new attribute. All the visualizations can be seen on @ref(fig:stat_new_delay).

First of all we focused on attribute delay. Since it is probably the most important attribute of the nexly created, we analysed its distribution, distribution of delay on log scale DOPLNIT CO SME NASLI.
\label{fig:stat_new_delay}Basic statistics of the added attribute delay.

(#fig:stat_new_delay-1)Basic statistics of the added attribute delay.

## [[1]]
\label{fig:stat_new_delay}Basic statistics of the added attribute delay.

(#fig:stat_new_delay-2)Basic statistics of the added attribute delay.

## 
## [[2]]
## NULL
## 
## [[3]]
\label{fig:stat_new_delay}Basic statistics of the added attribute delay.

(#fig:stat_new_delay-3)Basic statistics of the added attribute delay.

## 
## [[4]]
\label{fig:stat_new_delay}Basic statistics of the added attribute delay.

(#fig:stat_new_delay-4)Basic statistics of the added attribute delay.

delay_boxplot We also computed some basic statistics for the other added attributes. DOPLNIT CO SME NASLI
\label{fig:stat_new}Basic statistics of the added attributes..

(#fig:stat_new)Basic statistics of the added attributes..

1.5.3 Missing values

As can be seen on Figure, newly-created features also contain NA values. The highest percentage of missing values has attribute mean_delay_12m, almost 55 %. Together with mean_delay_6m, mean_delay_3m and mean_delay_1m, they are the only new attributes holding NA attributes.

It is not surprising, that these attributes have the highest percentage of NAs, since they compute results only every 12/6/3/1 months.

We decided to replace the NA values by 0, so they can be later used in the modeling part.

## # A tibble: 34 x 3
##    variable               n_miss pct_miss
##    <chr>                   <int>    <dbl>
##  1 mean_delay_12m         804268    55.4 
##  2 mean_delay_6m          458935    31.6 
##  3 mean_delay_3m          249422    17.2 
##  4 mean_delay_1m           95489     6.58
##  5 contract_id                 0     0   
##  6 payment_order               0     0   
##  7 due_date                    0     0   
##  8 living_area                 0     0   
##  9 different_contact_area      0     0   
## 10 kc_flag                     0     0   
## # ... with 24 more rows

1.6 Advanced statistics

paid_amount_gender_prod amount_contract amount_earnings amount_product amount_earnings

dependence_delay_gender amount_gender amount_gender_type_discount delay_gender_

2 Modeling

2.1 Class

2.2 Prediction model (21+ days)

2.3 Prediction model (140+ days)

2.4 Estimation of the expected number of days of delay when the client triggers first action

3 Conclusion

4 Discussion

5 Attachments

\label{fig:density}Density plots.

Figure 5.1: Density plots.

\label{fig:boxplots}Boxplots for numeric attributes.

Figure 5.2: Boxplots for numeric attributes.